We start loading the necessary libraries, we use:
- Seurat v3.2.1
- patchwork v1.0.1
- dplyr v1.0.2
- ggplot2 v3.3.2
library(Seurat)
library(patchwork)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
Load the count matrix, which was downloaded as an R object from https://panglaodb.se/view_data.php?sra=SRA667466&srs=SRS3059941
raw_counts <- get(load('SRA667466_SRS3059941.sparse.RData'))
In order to keep only the gene symbols and remove the ENSEMBL names we’ll intervene directly on the count matrix before creating the Seurat object. Therefore we start creating an helpful function and applying it to the row names of the matrix:
no_ens <- function(string) {
gene_name <- unlist(strsplit(string, split='_ENS', fixed=TRUE))[1]
gene_name <- toupper(gene_name)
gene_name
}
gene_names <- unlist(lapply(rownames(raw_counts), no_ens))
rownames(raw_counts) <- gene_names
head(rownames(raw_counts))
## [1] "00R_AC107638.2" "0610009O20RIK" "1110020A21RIK" "1600014C23RIK"
## [5] "1700007L15RIK" "1700015F17RIK"
We can now create the Seurat object. With the options min.cells and min.features, we can already pre-filter the cells and the genes while creating the object. In particular here we keep:
- Only genes detected in at least 3 cells (min.cells = 3)
- Only the cells where are detected at least 200 genes (min.features = 200)
ct <- CreateSeuratObject(counts = raw_counts, project = "spinal cord", min.cells = 3, min.features = 200)
## Warning: Non-unique features (rownames) present in the input matrix, making
## unique
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
ct
## An object of class Seurat
## 21942 features across 5129 samples within 1 assay
## Active assay: RNA (21942 features, 0 variable features)
For each cell, the Seurat object reports its metadata that at first include the number of counts (nCounts_RNA) and the number of genes detected (nFeatures_RNA). We want to add also the percentage of mitochondrial genes:
| orig.ident | nCount_RNA | nFeature_RNA | percent.mt | |
|---|---|---|---|---|
| AAACATACAAAGCA | spinal cord | 1599 | 934 | 2.8767980 |
| AAACATACACCTTT | spinal cord | 4924 | 2270 | 0.8123477 |
| AAACATACGGGTGA | spinal cord | 5922 | 2311 | 0.7429922 |
| AAACATACGTACCA | spinal cord | 5261 | 2032 | 0.6272572 |
| AAACATACGTAGGG | spinal cord | 4566 | 1709 | 0.5913272 |
| AAACATACTGTGAC | spinal cord | 3507 | 1711 | 0.5702880 |
| AAACATTGTCGTAG | spinal cord | 1624 | 1109 | 0.5541872 |
| AAACATTGTGATGC | spinal cord | 324 | 242 | 2.7777778 |
| AAACGCACGAAACA | spinal cord | 3592 | 1478 | 0.0000000 |
| AAACGCTGAGAGGC | spinal cord | 3009 | 1705 | 0.5317381 |
We can now plot the distribution of these 3 values with the VlnPlot function:
VlnPlot(ct, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, pt.size=0, cols = mint_color)
From this graph we can observe two spikes in the frequency of number of detected genes, one at around 800 and one at around 1800-1900. The counts number is variable as well and we can observe a higher spike at low counts (around 1000) and a second spike at around 5000.
The percentage of mithocondrial genes is low on average with a single spike at around 1%
We can plot the correlations between the number of counts and the percentage of mitochondrial genes as well as the correlation between the number of counts and the number of genes detected. This is useful to better understand the relationship between these variable, in fact none of these metadata columns is really indicative of the quality of the cell if taken alone.
plot1 <- FeatureScatter(ct, feature1 = "nCount_RNA", feature2 = "percent.mt", cols = mint_color)
plot2 <- FeatureScatter(ct, feature1 = "nCount_RNA", feature2 = "nFeature_RNA", cols = mint_color)
plot1 + plot2
There’s a small inverse correlation between the number of counts and the percentage of mitochondrial genes (Pearson correlation = -0.34). We can observe that at high percentage of mitochondrial RNA detected, the counts are very low, indicating that these values are attributabl e to a leakage of material from a damaged cell.
A high and direct correlation can instead be observed between the number of counts and the number of gene detected with a Pearson correlation coefficient of 0.93. We can suppose that high value for both of these features are due to doublets.
After observing these graphs, we selected the threshold useful to choose a subset of the cells. In this case we’ll keep cells with:
- more than 300 genes detected but less than 3500, so to exclude cells that have leaked or problematic cells (with less than 300 genes) and doublets (with more than 3500 genes)
- a percentage of mitochondrial genes lower than 3%, so to exclude those cells that had leaked or with other problems
- a total number of counts higher than 500 (cells with a significative number of counts)
These threshold have been set not to be extremely strict in order to retain as much biological information as possible.
ct <- subset(ct, subset = nFeature_RNA > 300
& nFeature_RNA < 3500
& percent.mt < 3
& nCount_RNA > 500)
ct
## An object of class Seurat
## 21942 features across 4477 samples within 1 assay
## Active assay: RNA (21942 features, 0 variable features)
We normalize the data applying the logarithm to the counts and scaling of a factor of 10000 so we’ll have counts per 10000 instead of counts per million.
ct <- NormalizeData(ct, normalization.method = "LogNormalize", scale.factor = 10000)
Now we can select the 4000 most variable genes in order to start reducing the many dimensions of the dataset. Later on, these features will be combined in principal components when performing PCA.
ct <- FindVariableFeatures(ct, selection.method = "vst", nfeatures = 4000)
ct
## An object of class Seurat
## 21942 features across 4477 samples within 1 assay
## Active assay: RNA (21942 features, 4000 variable features)
We then identify the 10 most highly variable genes and visualize these features in a Standardized Variance - Average Expression plot
top10 <- head(VariableFeatures(ct), 10)
plot1 <- VariableFeaturePlot(ct, cols = c(black_color, mint_color) )
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
plot2
## Warning: Transformation introduced infinite values in continuous x-axis
We use ScaleData to center the data of each gene passed in the feature parameter so that the mean is 0 and the standard deviation is 1 for each of those.
ct <- ScaleData(ct, features = rownames(ct))
## Centering and scaling data matrix
Cell clusters may be affected by the expression of cell cycle phase-specific genes. It is also more probable due to the fact that the samples from which the cells come from are od adolescent mice. Therefore, we checked if there actually was a cell cycle effect in our cells by following the vignette at https://satijalab.org/seurat/v3.2/cell_cycle_vignette.html
We start by computing the cell cycle score for the phases S and G2:
s.genes <- cc.genes$s.genes
g2m.genes <- cc.genes$g2m.genes
ct <- CellCycleScoring(ct, s.features = s.genes, g2m.features = g2m.genes, set.ident = TRUE)
## Warning: The following features are not present in the object: UNG, UHRF1,
## MLF1IP, UBR7, USP1, not searching for symbol synonyms
## Warning: The following features are not present in the object: UBE2C, not
## searching for symbol synonyms
Then we can run a PCA on cell cycle genes to determine whether cells separate by phase:
ct <- RunPCA(ct, features = c(s.genes, g2m.genes))
## Warning in PrepDR(object = object, features = features, verbose = verbose): The
## following 6 features requested have not been scaled (running reduction without
## them): UNG, UHRF1, MLF1IP, UBR7, USP1, UBE2C
## Warning in irlba(A = t(x = object), nv = npcs, ...): You're computing too large
## a percentage of total singular values, use a standard svd instead.
## PC_ 1
## Positive: BIRC5, CDK1, TOP2A, CDCA8, CENPA, MKI67, CDCA3, CCNB2, CDC20, HMGB2
## CENPF, CKS2, FAM64A, RRM2, CKS1B, NUSAP1, CKAP2, TACC3, CKAP2L, CDCA2
## TPX2, CENPE, AURKA, KIF23, SMC4, TMPO, AURKB, ECT2, HN1, HMMR
## Negative: FEN1, ANLN, CCNE2, BLM, WDR76, CASP8AP2, POLD3, CDCA7, DLGAP5, POLA1
## HELLS, LBR, CBX5, GTSE1, CTCF, CLSPN, EXO1, E2F8, DSCC1, CDC6
## CKAP5, RFC2, MSH2, CHAF1B, SLBP, DTL, G2E3, GAS2L3, MCM2, PRIM1
## PC_ 2
## Positive: CENPA, TACC3, CDC20, CDCA3, CENPF, CCNB2, CENPE, FAM64A, CKS2, BIRC5
## AURKA, G2E3, CKAP2, CKAP5, HMMR, KIF20B, PSRC1, KIF11, KIF23, CDC25C
## NEK2, NUF2, ANLN, CDCA8, ECT2, HJURP, BUB1, CDCA2, NDC80, KIF2C
## Negative: MCM6, PCNA, RPA2, MCM2, TIPIN, GINS2, GMNN, MCM5, ATAD2, RRM2
## CHAF1B, MCM4, NASP, PRIM1, HELLS, TYMS, RFC2, POLA1, CDC45, HMGB2
## RAD51AP1, MSH2, CDC6, RAD51, TMPO, TOP2A, CDCA7, SLBP, POLD3, DSCC1
## PC_ 3
## Positive: ANLN, RFC2, SMC4, PCNA, LBR, G2E3, MCM4, NCAPD2, CDC45, CASP8AP2
## RPA2, AURKA, CDCA8, CDCA3, GMNN, TOP2A, RRM1, GINS2, KIF11, POLA1
## KIF20B, BRIP1, CDC6, BLM, NDC80, CDCA2, WDR76, RRM2, CDC25C, CTCF
## Negative: HN1, CBX5, TYMS, TMPO, RANGAP1, CKAP5, TIPIN, MSH2, POLD3, MCM6
## HELLS, NUSAP1, HMGB2, NASP, CDCA7, RAD51, CKAP2, ANP32E, TPX2, GAS2L3
## TUBB4B, NUF2, MCM2, MKI67, PRIM1, DLGAP5, DTL, SLBP, CCNE2, KIF23
## PC_ 4
## Positive: RANGAP1, TUBB4B, CBX5, RFC2, HN1, PCNA, ANLN, POLD3, SLBP, ANP32E
## NUSAP1, MKI67, RPA2, MSH2, CASP8AP2, PRIM1, MCM4, NUF2, CDC20, CKAP5
## POLA1, BLM, HMMR, G2E3, CCNE2, TACC3, NCAPD2, RAD51AP1, CDCA3, CDC25C
## Negative: HMGB2, TYMS, CKS2, MCM6, HJURP, TIPIN, CKS1B, CDC45, GINS2, ATAD2
## CKAP2L, CDCA7, CDCA2, HELLS, RRM1, NASP, GMNN, CKAP2, SMC4, MCM5
## AURKB, BIRC5, CDC6, WDR76, BRIP1, ECT2, KIF23, DTL, MCM2, TTK
## PC_ 5
## Positive: MSH2, GINS2, PRIM1, CKAP5, NUSAP1, RAD51, CASP8AP2, TYMS, LBR, CKS1B
## NASP, RAD51AP1, CDCA8, RRM2, BIRC5, NCAPD2, CTCF, AURKB, ANLN, CDC20
## BLM, ATAD2, PCNA, CKS2, CDK1, EXO1, CBX5, CLSPN, CCNB2, CDC45
## Negative: RRM1, RPA2, ANP32E, MCM6, SLBP, TIPIN, HJURP, MCM5, TACC3, KIF23
## G2E3, PSRC1, CENPE, KIF20B, CDCA7, CKAP2, TUBB4B, GAS2L3, TMPO, NEK2
## HN1, CDC25C, RFC2, HELLS, MKI67, CENPF, NDC80, TPX2, CDCA3, SMC4
DimPlot(ct, dims = c(1, 2), cols = cell_cycle_colors)
DimPlot(ct, dims = c(3, 4), cols = cell_cycle_colors)
DimPlot(ct, dims = c(5, 6), cols = cell_cycle_colors)
Cells do not appear to cluster differently based on cell cycle, therefore it won’t be necessary to regress its effect out.
In order to reduce the dimensions of the matrix we can perform the PCA on the 4000 most variable genes and select a certain number of principal components.
ct <- RunPCA(ct, features = VariableFeatures(object = ct))
## PC_ 1
## Positive: TSPAN13, EBF1, TSC22D1, MEG3, LY6H, CAMK2N1, SNHG11, EGFL7, SSBP4, SPOCK2
## HOXB8, VSNL1, BSG, JUNB, SNAP25, STMN2, HPRT, SNCB, FXYD7, TAGLN3
## TSHZ2, PCP4L1, CALY, CELF4, STMN3, RESP18, GAP43, LY6E, IGFBP7, RTN1
## Negative: MOG, MAG, MOBP, JOSD2, PPP1R14A, TRF, SHISA4, PDLIM2, TMEFF2, CMTM5
## MAL, MBP, ERMN, GSN, NKX6-2, CERS2, GAMT, GJB1, TMEM151A, FA2H
## SEPT4, APLP1, GJC2, TUBB4A, GLTP, ELOVL1, TSPAN15, TSPAN2, TMEM125, RNF13
## PC_ 2
## Positive: SNHG11, PCSK1N, NAP1L5, LY6H, STMN3, SNAP25, SNCB, VSNL1, BEX2, MEG3
## SNRPN, FXYD7, HOXB8, SCG5, RTN1, RP23-83I13.10, CALY, RESP18, CELF4, RAB3A
## GNG3, PDXP, MLLT11, CHD3OS, ZWINT, GAP43, SCG2, BEX1, ATP6V1G2, RGS17
## Negative: CLDN5, LY6C1, PGLYRP1, FLT1, IGFBP7, ITM2A, ESAM, KLF2, SLCO1C1, CD34
## LY6A, ENG, ANXA3, ID1, CAR4, CTLA2A, RAMP2, IFITM3, KDR, SRGN
## GRAP, SPARC, CTSH, SLC22A8, SLCO1A4, VIM, EMCN, VWA1, ID3, COL4A2
## PC_ 3
## Positive: ALDOC, APOE, GPR37L1, SLC6A11, SLC1A2, AQP4, AGT, NTSR2, RAMP1, ATP1A2
## NDRG2, PLA2G7, FGFR3, CST3, SLC4A4, RP23-168E14.7, MT2, GJA1, CXCL14, ATP1B2
## ACSBG1, HTRA1, BCAN, CLU, ENHO, CLDN10, TIMP4, TRIL, MT3, MLC1
## Negative: FKBP1A, MYL6, TMSB10, CALM2, RAB3A, SNRPN, STMN4, RTN4, SEPT4, MAPK3
## TALDO1, TMEM151A, SLC9A3R2, ACOT7, APLP1, RAP1A, TPPP3, ELAVL3, SLC48A1, TPD52
## ARPC1A, MAL, MOG, TUBA1A, ARPC1B, TUBB4A, AGPAT4, KLF13, TMEFF2, PHLDA3
## PC_ 4
## Positive: ATP1A2, SPARCL1, HTRA1, MT2, ATP1B2, SLC1A2, SLC6A11, GJA1, AQP4, GPR37L1
## AGT, FGFR3, MT1, ALDOC, NTSR2, ENHO, CLU, SLC4A4, TTYH1, MT3
## PLA2G7, RP23-168E14.7, BCAN, NDRG2, SLC6A1, CXCL14, CLDN10, KCNJ10, TRIL, MMD2
## Negative: C1QA, CTSS, C1QC, TYROBP, SELPLG, C1QB, CSF1R, TMEM119, TREM2, P2RY12
## PLD4, FCER1G, HEXB, AIF1, FCRLS, GPR34, LY86, BIN2, CX3CR1, RNASE4
## CD14, FCGR3, CYTH4, LAPTM5, SPI1, IRF8, BCL2A1B, CCL3, FGD2, FAM105A
## PC_ 5
## Positive: NEU4, GPR17, MATN4, COL9A3, OPCML, C1QL1, FYN, MYCL, PAK4, RP23-95H5.1
## MARCKS, VCAN, MPZL1, BMP4, SH3BP4, AC103620.3, OLIG2, IGSF9B, TRAF4, SCRG1
## PYCR1, PXDC1, TCF7L2, LHFPL3, LIMS2, RP23-43M12.1, BFSP2, RP23-47H16.3, CD9, RTN1
## Negative: GLUL, CAR2, LDHB, CHCHD10, SLC6A11, MT1, AQP4, AGT, GJA1, PPP1R14A
## MAL, APOD, MT2, GJB1, NTSR2, NDRG1, PLA2G7, HIST1H2BC, SLC4A4, TRF
## ERMN, ITIH3, PLA2G16, ATP1B2, ACSBG1, FGFR3, GPD1, GM5514, ENHO, STMN4
DimPlot(ct, reduction = "pca", cols = cell_cycle_colors)
DimPlot(ct, reduction = "pca", dims = c(3,4), cols = cell_cycle_colors)
DimPlot(ct, reduction = "pca", dims = c(5,6), cols = cell_cycle_colors)
We can look at the top genes associated with reduction components:
VizDimLoadings(ct, dims = 1:2, reduction = "pca", col = mint_color)
We can also visualize for the first 500 cells how the genes are differentially expressed.
DimHeatmap(ct, dims = 1:6, cells = 500, balanced = TRUE)
We can spot a high difference between one cell and the other.
We can determine the number of principal components to use with two methods:
- by looking at the elbow plot
- determine percent of variation associated with each PC and determine the last PC where change of percentage of variation is more than 0.05%.
Elbow plot:
ElbowPlot(ct, ndims = 50)
Determine the last PC where change of percentage of variation is more than 0.05%:
pct <- (ct@reductions$pca@stdev / sum(ct@reductions$pca@stdev) ) * 100
cum <- cumsum(pct)
co2 <- sort(which((pct[1:length(pct)-1] - pct[2:length(pct)]) > 0.05), decreasing = T)[1] + 1 # last point where change of % of variation is more than 0.05%.
co2
## [1] 22
The elbow plot and the quantitative methods appears to agree that the number significative PC to use is 22.
We can proceed with the clustering: first we use FindNeighbors we construct a shared nearest neighbor (SNN) graph, which exploit the k-nearest neighbors algorithm and then calculate the neighborhood overlap, and then we use FindClusters that identify clusters based on the graph just built.
Finally RunUMAP will run the Uniform Manifold Approximation and Projection, a dimentional reduction technique that is particularly useful for visualization.
ct <- FindNeighbors(ct, dims = 1:22)
## Computing nearest neighbor graph
## Computing SNN
ct <- FindClusters(ct , resolution = 1.0)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 4477
## Number of edges: 153060
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8848
## Number of communities: 22
## Elapsed time: 1 seconds
ct <- RunUMAP(ct , dims = 1:22)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 01:10:38 UMAP embedding parameters a = 0.9922 b = 1.112
## 01:10:38 Read 4477 rows and found 22 numeric columns
## 01:10:38 Using Annoy for neighbor search, n_neighbors = 30
## 01:10:38 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 01:10:40 Writing NN index file to temp file /tmp/RtmpSsfcde/file42b113fed3c8
## 01:10:40 Searching Annoy index using 1 thread, search_k = 3000
## 01:10:43 Annoy recall = 100%
## 01:10:43 Commencing smooth kNN distance calibration using 1 thread
## 01:10:45 Initializing from normalized Laplacian + noise
## 01:10:46 Commencing optimization for 500 epochs, with 178904 positive edges
## 01:10:56 Optimization finished
DimPlot(ct , reduction = "umap", combine = T)
For each computed cluster we can compute the number of cells:
# save number of cells for each cluster in a variable
ncells <- table(ct@active.ident)
ncells <- data.frame(ncells)
colnames(ncells) <- c('Cluster', 'NCells')
knitr::kable(ncells)
| Cluster | NCells |
|---|---|
| 0 | 561 |
| 1 | 543 |
| 2 | 490 |
| 3 | 351 |
| 4 | 280 |
| 5 | 272 |
| 6 | 239 |
| 7 | 221 |
| 8 | 220 |
| 9 | 220 |
| 10 | 174 |
| 11 | 165 |
| 12 | 141 |
| 13 | 123 |
| 14 | 119 |
| 15 | 85 |
| 16 | 83 |
| 17 | 72 |
| 18 | 44 |
| 19 | 41 |
| 20 | 17 |
| 21 | 16 |
In order to correctly identify each cluster, we run FindAllMarkers and select the top 3 gene markers according to their average log fold change for each cluster.
markers <- FindAllMarkers(ct, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
## Calculating cluster 9
## Calculating cluster 10
## Calculating cluster 11
## Calculating cluster 12
## Calculating cluster 13
## Calculating cluster 14
## Calculating cluster 15
## Calculating cluster 16
## Calculating cluster 17
## Calculating cluster 18
## Calculating cluster 19
## Calculating cluster 20
## Calculating cluster 21
top_markers <- markers %>% group_by(cluster) %>% top_n(n = 3, wt = avg_logFC)
knitr::kable(head(top_markers))
| p_val | avg_logFC | pct.1 | pct.2 | p_val_adj | cluster | gene |
|---|---|---|---|---|---|---|
| 0 | 1.884171 | 0.872 | 0.273 | 0 | 0 | PTGDS |
| 0 | 1.299457 | 1.000 | 0.561 | 0 | 0 | CAR2 |
| 0 | 1.204814 | 0.970 | 0.513 | 0 | 0 | QDPR |
| 0 | 3.820247 | 1.000 | 0.361 | 0 | 1 | APOE |
| 0 | 3.632850 | 0.996 | 0.251 | 0 | 1 | ALDOC |
| 0 | 3.515893 | 0.998 | 0.290 | 0 | 1 | ATP1A2 |
We can look at the expression of some of these genes in our clusters with a Violin Plot
markers_for_plots <- markers %>% group_by(cluster) %>% top_n(n = 1, wt = avg_logFC)
VlnPlot(ct, features = markers_for_plots$gene[1], pt.size = 0, cols = c(nine_colors, nine_colors, nine_colors[1:4])) + theme(axis.text.x = element_text(angle = 90), legend.position = "none")
VlnPlot(ct, features = markers_for_plots$gene[2], pt.size = 0, cols = c(nine_colors, nine_colors, nine_colors[1:4])) + theme(axis.text.x = element_text(angle = 90), legend.position = "none")
VlnPlot(ct, features = markers_for_plots$gene[3], pt.size = 0, cols = c(nine_colors, nine_colors, nine_colors[1:4])) + theme(axis.text.x = element_text(angle = 90), legend.position = "none")
VlnPlot(ct, features = markers_for_plots$gene[4], pt.size = 0, cols = c(nine_colors, nine_colors, nine_colors[1:4])) + theme(axis.text.x = element_text(angle = 90), legend.position = "none")
As we can see from these four violin plots, even if marker genes no one aloe define one cluster. We also can expect to have clusters belonging to the same cell type.
These genes can be used as input in https://panglaodb.se/search.html so to determine, according to most probable cell type and tissue, what cell types these clusters are. After having identified correctly the cell type of each cluster, we can rename and plot the clusters grouped by cell type.
ids <- c('Oligodendrocytes', 'Astrocytes', 'Oligodendrocytes', 'Oligodendrocytes', 'Oligodendrocytes Progenitor Cells', 'Endothelial Cells', 'Oligodendrocytes', 'Oligodendrocytes', 'Oligodendrocytes', 'Oligodendrocytes', 'Endothelial Cells', 'Neurons', 'Neurons', 'Neurons', 'Microglia', 'Neurons', 'Neurons', 'Neurons', 'Unknown', 'Oligodendrocytes', 'Pericytes', 'Ependymal cells')
names(ids) <- levels(ct)
ct <- RenameIdents(ct, ids)
DimPlot(ct, reduction = "umap", label = TRUE, pt.size = 0.5, cols = c("#56cc9d","#f5d491","#8cc9f8","#a899ff","#937595","#477998","#c44550","#f3969a","#81625f")) + NoLegend()
We can now explore our results. Check how many cells there are in each cluster and bind the results to the table with the top markers:
top_markers$ncells <- rep(ncells$NCells, each = 3)
top_markers$cluster <- rep(c(0:21), each = 3)
top_markers$cell_type <- rep(ids, each = 3)
knitr::kable(head(top_markers))
| p_val | avg_logFC | pct.1 | pct.2 | p_val_adj | cluster | gene | ncells | cell_type |
|---|---|---|---|---|---|---|---|---|
| 0 | 1.884171 | 0.872 | 0.273 | 0 | 0 | PTGDS | 561 | Oligodendrocytes |
| 0 | 1.299457 | 1.000 | 0.561 | 0 | 0 | CAR2 | 561 | Oligodendrocytes |
| 0 | 1.204814 | 0.970 | 0.513 | 0 | 0 | QDPR | 561 | Oligodendrocytes |
| 0 | 3.820247 | 1.000 | 0.361 | 0 | 1 | APOE | 543 | Astrocytes |
| 0 | 3.632850 | 0.996 | 0.251 | 0 | 1 | ALDOC | 543 | Astrocytes |
| 0 | 3.515893 | 0.998 | 0.290 | 0 | 1 | ATP1A2 | 543 | Astrocytes |
We can also see for each cell type how many cells belons to it:
ncells_type <- table(ct@active.ident)
ncells_type <- data.frame(ncells_type)
colnames(ncells_type) <- c('Cell type', 'NCells')
knitr::kable(ncells_type)
| Cell type | NCells |
|---|---|
| Oligodendrocytes | 2343 |
| Astrocytes | 543 |
| Oligodendrocytes Progenitor Cells | 280 |
| Endothelial Cells | 446 |
| Neurons | 669 |
| Microglia | 119 |
| Unknown | 44 |
| Pericytes | 17 |
| Ependymal cells | 16 |
Use a DotPlot to visualize the expression of the top marker genes in different cells:
features <- markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
features <- unique(features$gene)
DotPlot(ct, features = features, cols = c('lightgrey', mint_color), cluster.idents = T) + theme(axis.text.x = element_text(angle = 90))
Similarly, we can see in which cells some genes are more or less expressed in the graph with the clusters. For this purpose we’ll select the 9 genes that are more significative.
FeaturePlot(ct, features = c('PTGDS', 'APOE', 'C1QL1', 'CLDN5', 'PCP4', 'HEXB', 'HMGB2', 'AC159092.1', 'VTN'), cols = c('lightgrey', mint_color)) + theme(axis.text.x = element_text(angle = 90))